Inference of genomic landscapes using ordered Hidden Markov Models with emission densities (oHMMed)

Background Genomes are inherently inhomogeneous, with features such as base composition, recombination, gene density, and gene expression varying along chromosomes. Evolutionary, biological, and biomedical analyses aim to quantify this variation, account for it during inference procedures, and ultimately determine the causal processes behind it. Since sequential observations along chromosomes are not independent, it is unsurprising that autocorrelation patterns have been observed e.g., in human base composition. In this article, we develop a class of Hidden Markov Models (HMMs) called oHMMed (ordered HMM with emission densities, the corresponding R package of the same name is available on CRAN): They identify the number of comparably homogeneous regions within autocorrelated observed sequences. These are modelled as discrete hidden states; the observed data points are realisations of continuous probability distributions with state-specific means that enable ordering of these distributions. The observed sequence is labelled according to the hidden states, permitting only neighbouring states that are also neighbours within the ordering of their associated distributions. The parameters that characterise these state-specific distributions are inferred. Results We apply our oHMMed algorithms to the proportion of G and C bases (modelled as a mixture of normal distributions) and the number of genes (modelled as a mixture of poisson-gamma distributions) in windows along the human, mouse, and fruit fly genomes. This results in a partitioning of the genomes into regions by statistically distinguishable averages of these features, and in a characterisation of their continuous patterns of variation. In regard to the genomic G and C proportion, this latter result distinguishes oHMMed from segmentation algorithms based in isochore or compositional domain theory. We further use oHMMed to conduct a detailed analysis of variation of chromatin accessibility (ATAC-seq) and epigenetic markers H3K27ac and H3K27me3 (modelled as a mixture of poisson-gamma distributions) along the human chromosome 1 and their correlations. Conclusions Our algorithms provide a biologically assumption free approach to characterising genomic landscapes shaped by continuous, autocorrelated patterns of variation. Despite this, the resulting genome segmentation enables extraction of compositionally distinct regions for further downstream analyses. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-024-05751-4.


Comparing oHMMed to its Unordered Counterpart
The key feature of the oHMMed is its utilisation of convex emission densities, which then enables the ordering of hidden states.Recall that convexity is achieved by assuming a common standard deviation in the case of normal emissions, and a common shape parameter for the gamma-poisson emissions.Relaxing the convexity assumption yields a more general unordered HMM with emission densities which we will call unoHMMed.In this section, we first document the changes in the oHMMed algorithm required to implement unoHMMed, and then compare the performance of the methods.We do this on both simulated sequences that exhibit clear autocorrelation (i.e., were generated by oHMMed), as well as the human data from the main text.

Implementation of unoHMMed 1.Transition Matrix
To update the transition matrix T, the steps corresponding to the restriction of transitions and the maintenance of symmetry, i.e., the sampling and the addition of auxiliary variables from the Dirichlet distribution from Additional File 1, Section A1 (precisely: Step 3 of the oHMMed iteration procedure), can simply be excluded for both normal and gamma-poisson emission densities.

Normal Emissions
In the case of normal emission densities, the main difference between the ordered and unordered algorithms concerns the way in which the means and variances are sampled.Now that each state has a different standard deviation, these must be individually updated (compare to Additional File 1, Section A1, Step 2 of the oHMMed iteration procedure).Define: The posterior standard deviations can then be sampled as follows, with posterior degrees of freedom ν pi = 1 + L i and posterior scale parameters using a vector of priors σ 0 : By contrast, the means of each state can sampled as described for the oHMMed algorithms, and simply are not sorted (compare to Additional File 1, Section A1 , Step 2 of the oHMMed iteration procedure).

Gamma-Poisson Distribution
In the case of the gamma-poisson emission densities, the sampling of both the α and β parameters in the MCMC sampling scheme must be modified.Without assuming a common α but individual α i per state instead, the β i must be sampled from a Gamma distribution with the following posterior shape parameters per state: i + δ 0 (compare Additional File 1, Section A2 under 'Updating β i ' ).Again, the resulting, newly sampled β i are then not sorted.
For the individual α i themselves, the conditional posteriors utilized in the sampling procedure are proportional to (compare Additional File 1, Section A2 under 'Updating α' ): with where λ 0i is a vector of priors and Given the new α * i sampled from the jump distribution as was described in the main text, the decision rule ratio is applied to each of these separately.

Implementation Note
Finally, note that unoHMMed as described here was implemented as an equivalent to oHMMed rather than a fully fleshed out method in its own right.Further modifications unique to unoHMMed (such as tuning of updates, or reparameterising the models) may improve convergence behaviour.As such, the convergence traces generated by unoHMMed generally do not mix as well as those of oHMMed, particularly on long sequences and especially so for gammapoisson emission densities.Situations in which mean and median log-likelihood of a fitted model differ drastically (for examples for oHMMed see e.g.K = 6 for the human GC proportion in the main text, or K = 2 for the human gene content in the main text, and the variance in convergence traces are also very large then become the norm rather than a rare occurrence indicative of poor model specification; this can be seen explicitly in the evaluation of the loglikelihood of unoHMMed with poisson-gamma emissions on simulated sequences Fig. 4B.

Summary and Interpretation
The series of results presented in this subsection serves to emphasise that oHMMed was designed to infer a well-fitting model by assigning every one of a sequence of observations to a fixed number of hidden states with mean emission densities that can be discriminated with high statistical certainty.By contrast, unoHMMed appears to prioritise the model fit in terms of the overall emission density over clearly separated states.Recall, however, that the hidden states in this case are also not directly comparable using a single metric.For our simulations, we nonetheless sorted the states inferred by unoHMMed by mean emissions to facilitate the comparison of the methods.We chose to assess performance by collating the following results for each method across numerous inference runs on sequences that were simulated using the same underlying oHMMed model each time: the log-likelihood and the percentage of correctly assigned hidden states (which evaluate model fit), the deviation of the estimated values of the model parameters from the true values (which show general inference accuracy), and the number of inferred hidden states with significantly different mean emission densities.For the metrics of model fit and inference accuracy, the variances and means of the repetitions are compared between methods: F-tests are applied to check for difference in variances, and subsequently for the difference in mean either a t-test (when variances are equal) or Welch's test (when they are not) is employed.The numbers of distinctly different hidden states across repetitions are compared between methods via a χ 2 -test.We report the significant results (below Figs. 1, 2, 3, 4).-Overall, the main question here is: Given provable autocorrelation in an observed sequence, for e.g., along a genome, which method is best suited for handling it?
Generally, the simulations show that both methods yield very similar and appreciably accurate results in the case of normal emission densities; this is even the case with very short sequences (Fig. 1A,C,D), although performance is naturally improved with sequence length (Fig. 2A,C,D).With short sequences, the effect of oHMMed preferentially maintaining distinguishable hidden states (Fig. 1C) at the cost of overestimating the transition rates between states as well as the means of the state-specific emission densities (Fig. 1D) is more pronounced.At the same time, unoHHMed indeed tends to infer hidden states that are not statistically distinguishable by their mean emissions (Fig. 1C).This is largely because the estimated standard deviations per state are less accurate (with some being over-and others underestimated; Fig. 1E).
In the case of gamma-poisson emission densities, performance of oHMMed and unoHMMed differs more markedly, with oHMMed clearly outperforming unoHMMed in terms of overall model fit (Figs.3A,B and 4A,B), although this only becomes very significant for longer sequences.Similarly to before, oHMMed most often maintains the number of statistically distinct hidden states while un-oHMMed does not, especially for short sequences (Figs.3C, 4C).This difference is further reflected by slight underestimation of both α and the transition rates between hidden states by oHMMed, and overestimation of the β i by unoHMMed (Figs.3D, 4D).
The patterns concerning model fit and parameter estimation are noticeable in the comparison of oHMMed and unoHMMed on the hominid data (see Sec. 1.2.4).However, annotation of the genome according to both mean GC content and mean gene density yield very similar results between the methods since the full autosomal data (and therefore very long sequences) were used.Figure 1: Performance of both methods was tested on sequences of length 500.These were simulated with oHMMed under the assumption of five hidden states emitting normal distributions with means of 1,2,3,4,5, a shared standard distribution of 1, and a probability of 0.1 of transitioning to any permitted neighbouring state.Then, both oHMMed and unoHMMed were used for inference; 2000 iterations (1000 of which were burn-in) were performed with prior and initial values as set per the usage recommendations for oHMMed on GitHub [1] .This sequence of simulation and inference was repeated 400 times, and the above figure evaluates the results across these repetitions.Note that although unoHMMed does not immediately allow ordering of hidden states, we sorted them by increasing mean emissions to facilitate comparison to oHMMed.In panel A and panel B, we show boxplots of the percentage of correctly assigned hidden states and the overall log-likelihood; means and variances of both measures are clearly non-significant between the two methods.Panel D shows barplots of how many of the mean emissions were deemed significantly different by one-sided t-tests (confidence level 0.95) after ordering the hidden states by increasing mean; the distribution of these counts differs significantly between methods (χ 2 = 36.311,simulated p-val < 0.05).

Simulations with Normal Emissions
In Panel E, boxplots of the true values minus the inferred values for all the parameters are shown (averages in the cases of state-specific parameters).Between the two methods, the estimates of the mean differ significantly in both variance and means (F-val= 21.363 with df= 1, p-val< 0.05, and Welch t-val= 4.329 with df= 436.27, p-val< 0.05); the same is true for the estimated of the transition rates (F-val= 2.173e 31 with df= 1, p-val< 0.05, and Welch t-val= 18.731 with df= 399, p-val< 0.05).The estimates for the standard deviation differ only in variance (F-val= 4.975e −33 with df= 1, p-val< 0.05).Figure 3: Performance of both methods was tested on sequences of length 2x10 3 .These were simulated with oHMMed under the assumption of three hidden states emitting gamma-poisson distributions with shared α = 1.5, respective β i of 0.029, 0.048, and a probability of 0.1 of transitioning to any permitted neighbouring state.Then, both oHMMed and unoHMMed were used for inference; 2000 iterations (1000 of which were burn-in) were performed with prior and initial values as set per the usage recommendations for oHMMed on GitHub [1].This sequence of simulation and inference was repeated 12 times, and the above figure evaluates the results across these repetitions.Note that although unoHMMed does not immediately allow ordering of hidden states, we sorted them by increasing mean emissions to allow comparison to oHMMed.Note that since there were few repetitions of the testing procedure, the boxplots from panels A,B,E are superimposed on the individual values of the respective measure.In panel A and panel B, we show boxplots of the percentage of correctly assigned hidden states and the overall log-likelihood; the former measure differs significantly in mean between the two methods (Student's t-val=7.019with df=22, p-val < 0.05).Panel C shows barplots of how many of the mean emissions of the ordered states were deemed significantly different by one-sided poisson rate tests with confidence level 0.95; the distribution of these counts differs significantly (χ 2 = 14.4,simulated p-val < 0.05).In Panel D, boxplots of the true values minus the inferred values for all the parameters are shown (averages in the cases of statespecific parameters).Between the two methods, the estimates of β differ significantly in both variance and mean (F-val= 2.218e −32 with df= 1, p-val< 0.05, and Welch t-val= −20.443 with df= 11, p-val< 0.05); the same is true for the estimates of the transition rates (F-val= 1.650e 29 with df= 1, p-val< 0.05, and Welch t-val= −2.639 with df= 11, p-val< 0.05).The estimates of α differ only in variance (F-val= 59.358 with df= 1, p-val< 0.05).4) and increased the sequence length to 3x10 5 .We then performed 12 repetitions of the simulation and inference procedure.The above figure summarises the results across the repetitions.In panels A and B, one sees that the percentage of correctly assigned hidden states and the overall log-likelihood differ significantly in mean between the two methods, with oHMMed also being more precise in terms of the log-likelihood (respectively: Student's t-val= 149.9 with df= 22, p-val< 0.05; and F-val= 0.112 with df= 1, p-val< 0.05, and Welch t-val= 25.689 with df= 13.431, p-val< 0.05 respectively).Further, panel D reveals that between the two methods, the estimates of β differ significantly in both variance and mean (F-val= 7.138e −33 with df= 1, p-val< 0.05, and Welch t-val= −3.401 with df= 11, p-val< 0.05); the same is true for the estimated of the transition rates (F-val= 1.143e 31 with df= 1, p-val< 0.05, and Welch t-val= −5776 with df= 11, p-val< 0.05).The estimates of α differ only in variance (F-val= 59.358 with df= 1, p-val< 0.05).

Figure 2 :
Figure 2: Keeping the same general set-up as in Fig. (1), we increased the sequence length to 2 12 and performed 30 repetitions of the simulation and inference procedure.The above figure summarises the results across the repetitions.Since fewer of these were performed, the boxplots in panels A,B,E are superimposed on the individual values.The only significant difference in performance between the two methods lies in the variance of the estimators of the mean, standard deviation and transition rates (respectively: F-val= 78.908 with df= 1, p-val< 0.05; and F-val= 1.443e −33 with df= 1, p-val< 0.05; and F-val= 1.244e 31 with df= 1, p-val< 0.05).

Figure 4 :
Figure 4: Here, we retained the set-up from Fig. (4) and increased the sequence length to 3x10 5 .We then performed 12 repetitions of the simulation and inference procedure.The above figure summarises the results across the repetitions.In panels A and B, one sees that the percentage of correctly assigned hidden states and the overall log-likelihood differ significantly in mean between the two methods, with oHMMed also being more precise in terms of the log-likelihood (respectively: Student's t-val= 149.9 with df= 22, p-val< 0.05; and F-val= 0.112 with df= 1, p-val< 0.05, and Welch t-val= 25.689 with df= 13.431, p-val< 0.05 respectively).Further, panel D reveals that between the two methods, the estimates of β differ significantly in both variance and mean (F-val= 7.138e −33 with df= 1, p-val< 0.05, and Welch t-val= −3.401 with df= 11, p-val< 0.05); the same is true for the estimated of the transition rates (F-val= 1.143e 31 with df= 1, p-val< 0.05, and Welch t-val= −5776 with df= 11, p-val< 0.05).The estimates of α differ only in variance (F-val= 59.358 with df= 1, p-val< 0.05).